Closes #113 - Add Chebi (Chapti) #525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

napsternxg wants to merge 7 commits into bigscience-workshop:main from napsternxg:chebi

Contributor

napsternxg commented Apr 28, 2022 •

edited

Loading

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Fixes #113

If the following information is NOT present in the issue, please populate:

Name: chebi
Description: http://chebi.cvs.sourceforge.net/viewvc/chebi/chapati/
Paper: https://europepmc.org/articles/PMC2238832
Data: http://chebi.cvs.sourceforge.net/viewvc/chebi/chapati/

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

napsternxg and others added 3 commits

April 11, 2022 16:26


          Fixes bigscience-workshop#113 - Add Chebi (Chapti)

66090e3


          Updated info. Waiting on questions.

66a2609


          Added working code for ChEBI

6afa559

napsternxg requested review from debajyotidatta, galtay, hakunanatasha, jason-fries, leonweber, ruisi-su, sg-wbi and sunnnymskang as code owners

April 28, 2022 13:39

napsternxg mentioned this pull request

Create dataset loader for CHEBI (Chapati) #113

Open

sg-wbi changed the title ~~Fixes #113 - Add Chebi (Chapti)~~ Closes #113 - Add Chebi (Chapti)

mariosaenger self-assigned this

Mario Sänger added 3 commits

October 26, 2024 09:11


          Merge branch 'main' into chebi

5185fa4


          refactor: Refactor implementation of chebi corpus to new hub-style

b4cf675


          fix: Remove init files for consistency

mariosaenger requested a review from phlobo

October 26, 2024 07:32

Collaborator

mariosaenger commented Oct 26, 2024

@phlobo Please have a look at the implementation.


          fix: Update README.md

44c413e

phlobo requested changes

View reviewed changes

Collaborator

phlobo left a comment

@mariosaenger I have some questions regarding this dataset, could you please have a look?

bigbio/hub/hub_repos/chebi/README.md

+              issn = {0305-1048},
+              pages = {D344—50},
+              url = {https://europepmc.org/articles/PMC2238832},
+              biburl = {https://aclanthology.org/W19-5008.bib},

Collaborator

phlobo Dec 6, 2024

Looks like biburl and bibsource belong to a different dataset

bigbio/hub/hub_repos/chebi/chebi.py

+              issn = {0305-1048},
+              pages = {D344—50},
+              url = {https://europepmc.org/articles/PMC2238832},
+              biburl = {https://aclanthology.org/W19-5008.bib},

Collaborator

phlobo Dec 6, 2024

Looks like biburl and bibsource belong to a different dataset

bigbio/hub/hub_repos/chebi/chebi.py

+              DATA_URL = "https://github.com/bigscience-workshop/biomedical/files/8568960/PatentAnnotations_GoldStandard.tar.gz"
+              _URLS = {
+                  # The original dataset is hosted on CVS on sourceforge. Hence I have downloaded and reuploded it as tar.gz format.

Collaborator

phlobo Dec 6, 2024

The provenance of this dataset seems to be tricky. It is not mentioned in the original ChEBI publication in NAR, that we have as a citation. Is there any other source of information about the annotation project? Otherwise, it is a bit hard to tell if we got the number of annotations, IDs, etc. right.

bigbio/hub/hub_repos/chebi/chebi.py

+                  # Converted via the following command:
+                  # cvs -z3 -d:pserver:[email protected]:/cvsroot/chebi co \
+                  #   chapati/patentsGoldStandard/PatentAnnotations_GoldStandard.tgz
+                  # mkdir -p ./MoNERo

Collaborator

phlobo Dec 6, 2024

The MoNERo part belongs to a different dataset?

bigbio/hub/hub_repos/chebi/chebi.py

+                              "offsets": [[e["start"], e["end"]]],
+                              "type": e["attrs"]["type"],
+                              "normalized": [
+                                  {"db_name": "chebi", "db_id": chebi_id.strip()} for chebi_id in e["attrs"]["chebi-id"].split(",")

Collaborator

phlobo Dec 6, 2024

The IDs extracted this way look inconsistent, e.g.:

 {'id': 'WO2007000651-E8',
   'type': 'CHEMICAL',
   'text': ['Zinc oxide'],
   'offsets': [[613, 623]],
   'normalized': [{'db_name': 'chebi', 'db_id': 'CHEBI:36560'}]},
  {'id': 'WO2007000651-E9',
   'type': 'ONT',
   'text': ['astringent'],
   'offsets': [[690, 700]],
   'normalized': [{'db_name': 'chebi', 'db_id': 'WO2007000651:157583'}]},

Maybe the db_id should just contain the last numerical bit?

bigbio/hub/hub_repos/chebi/chebi.py

+                              "offsets": [[e["start"], e["end"]]],
+                              "type": e["attrs"]["type"],
+                              "normalized": [
+                                  {"db_name": "chebi", "db_id": chebi_id.strip()} for chebi_id in e["attrs"]["chebi-id"].split(",")

Collaborator

phlobo Dec 6, 2024

Moreover, there seem to be more identifiers attached to each entity, e.g., in the source version there are entries like 'epochem-id': 'EPOCHEM:NEW:CLASS:4',. Shall we include them as additional normalized entries with db_id : epochem? Might be a relevant NED task for some users of the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

phlobo phlobo requested changes

hakunanatasha Awaiting requested review from hakunanatasha hakunanatasha is a code owner

jason-fries Awaiting requested review from jason-fries jason-fries is a code owner

sunnnymskang Awaiting requested review from sunnnymskang sunnnymskang is a code owner

ruisi-su Awaiting requested review from ruisi-su ruisi-su is a code owner

galtay Awaiting requested review from galtay galtay is a code owner

leonweber Awaiting requested review from leonweber leonweber is a code owner

sg-wbi Awaiting requested review from sg-wbi sg-wbi is a code owner

debajyotidatta Awaiting requested review from debajyotidatta debajyotidatta is a code owner

Requested changes must be addressed to merge this pull request.

Labels

None yet